Geographical analysis of media flows
A multidimensional approach
Introduction
1 Corpus preparation
The aim of this section is to prepare a corpus of news related to a language and one or several countries over a period of time. As an example, we will try to prepare a corpus of news in french (fr) related to France (FRA), Belgium (BEL) and Algeria (DZA) over a period of 2 years from 1st Jan 2014 to 31th December 2015. As news include not only titles but also descriptions, we decide to break the descriptions in sentences and keep a maximum of 2 sentences by news, with a maximum of 100 tokens by sentence.
The data used in this example has been collected by the research project ANR Geomedia and are free to use for scientific and pedagogical purpose only. The content of the news should not be used or disseminated without the agreement of the newspapers.
1.1 Selection of media
We import the data provided by each media and put them in a single data.frame. Then we select the columns of interest
# Load data with the function fread (fast) and the encoding UTF-8
df1<-fread("data/source/fr_FRA_figaro_int.csv", encoding = "UTF-8")
df1$media<-"fr_FRA_figaro"
df2<-fread("data/source/fr_FRA_libera_int.csv", encoding = "UTF-8")
df2$media<-"fr_FRA_libera"
df3<-fread("data/source/fr_BEL_derheu_int.csv", encoding = "UTF-8")
df3$media<-"fr_BEL_derheu"
df4<-fread("data/source/fr_BEL_lesoir_int.csv", encoding = "UTF-8")
df4$media<-"fr_BEL_lesoir"
df5<-fread("data/source/fr_DZA_elwata_int.csv", encoding = "UTF-8")
df5$media<-"fr_DZA_elwata"
df6<-fread("data/source/fr_DZA_xpress_int.csv", encoding = "UTF-8")
df6$media<-"fr_DZA_xpress"
# transform in data.table format
df<-rbind(df1,df2,df3,df4,df4,df5,df6)
rm(df1,df2,df3,df4,df5,df6)
# select column of interest
df$id <- df$ID_Item
df$who <- df$media
df$when <- df$Date_Recup
df$text <- paste(df$Titre," . ", df$Description, sep="")
df<-df[,c("id","who","when","text")]
df<-df[order(when),]
# select period of interest
mintime<-as.Date("2014-01-01")
maxtime<-as.Date("2015-12-31")
df<-df[(is.na(df$when)==F),]
df<-df[as.Date(df$when) >= mintime,]
df<-df[as.Date(df$when) <= maxtime,]
# eliminate duplicate
df<-df[duplicated(df$text)==F,]1.2 Check of time frequency
1.2.1 Time divisions
We transform the previous data.frame in a data.table format for easier operations of aggregation
dt<-as.data.table(df)
dt$day <- as.Date(dt$when)
dt$week <- cut(dt$day, "weeks", start.on.monday=TRUE)
dt$month <- cut(dt$day, "months")
dt$weekday <- weekdays(dt$day)
# Save data frame
saveRDS(dt,"data/corpus/dt_mycorpus.RDS") 1.2.2 News by week
We examine if the distribution is regular by week for the different media of the corpus.
dt<-readRDS("data/corpus/dt_mycorpus.RDS")
news_weeks<-dt[,.(newstot=.N),by=.(week,who)]
p<-ggplot(news_weeks, aes(x=as.Date(week),y=newstot, col=who))+
geom_line()+
geom_smooth(method = 'loess', formula = 'y~x')+
scale_y_continuous("Number of news", limits = c(0,NA)) +
scale_x_date("Week (starting on monday)") +
ggtitle(label ="Corpus : distribution of news by week",
subtitle = "1st Jan 2014 to 31th Dec. 2015")
p1.2.3 News by weekday
We examine if the distribution is regular by weekday and check in particular the effect of the week-end.
#compute frequencies by weekday
news_weekdays<-dt[,.(newstot=.N),by=.(weekday,who)]
news_weekdays<-news_weekdays[,.(weekday,newspct=100*newstot/sum(newstot)),by=.(who)]
# Translate weekdays in english and order
news_weekdays$weekday<-as.factor(news_weekdays$weekday)
levels(news_weekdays$weekday)<-c("7.Sunday","4.Wednesday","1.Monday","2.Tuesday","3.Thursday","6.Sathurday","5.Friday")
news_weekdays$wkd<-as.factor(as.character(news_weekdays$weekday))
news_weekdays<-news_weekdays[order(news_weekdays$weekday),]
p<-ggplot(news_weekdays, aes(x=weekday,fill = who, y=newspct))+
geom_bar(position = "dodge", stat="identity")+
scale_y_continuous("Share of news (%)", limits = c(0,NA)) +
ggtitle(label ="Corpus : distribution of news by week day",
subtitle = "1st Jan 2014 to 31th Dec. 2015")
p1.3 Transform in quanteda corpus
1.3.1 Reshape news by sentences
The aim of this step is to harmonize the length of texts collected through rss. We decide to keep only the title of news and the two first sentences of descriptions when they are available. The result is stored in quanteda format.
Unfortunately, the division in text is sentences realized by quanteda is far from perfect which is due to problems in the collection of news. For example, the following text will be considered as a single sentence because the point is not followed by a blank character.
Le conflit est terminé.Mais la Russie est-elle d’accord avec la Turquie.
It is necessary to add a regular expression for the cleaning of text and the inclusion of a blank space " " after each point located after a lower case character and before an upper case character :
str_replace_all(txt,“(?<=[:lower:])\.(?=[:upper:])”, “\.”)
In order to obtain a text that will be recognised as made of 2 sentences.
Le conflit est terminé. Mais la Russie est-elle d’accord avec la Turquie.
Some sentences appears too short or too long for a sound analysis. Therefore, we decide to eliminate outliers based on the quantile of the numbe of tokens. In practice we decide to eliminate the sentences with more than 100 tokens or less than 3 tokensr
t1<-Sys.time()
dt<-readRDS("data/corpus/dt_mycorpus.RDS")
# clean sentences break (long !)
dt$text<-str_replace_all(dt$text,"(?<=[:lower:])\\.(?=[:upper:])", "\\. ")
# transform in quanteda
qd<-corpus(dt,docid_field = "id",text_field = "text")
# break in sentences
qd<-corpus_reshape(qd,to="sentences", use_docvars=T)
# Identify rank of sentences
qd$order<-as.numeric(as.data.frame(str_split(names(qd),"\\.", simplify=T))[,2])
# Select only title + maximum of 3 sentences
qd<-corpus_subset(qd, order < 5)
# filter by number of tokens by sentence
qd$nbt<-ntoken(texts(qd))
#mintok<-quantile(qd$nbt,0.01)
#maxtok<-quantile(qd$nbt,0.99)
#qd<-corpus_subset(qd, nbt>mintok)
qd<-corpus_subset(qd, nbt<100)
qd<-corpus_subset(qd, nbt>2)
# Save corpus in qd format
saveRDS(qd,"data/corpus/qd_mycorpus.RDS")
t2<-Sys.time()
paste("Program executed in ", t2-t1)
head(qd)
summary(qd,3)1.3.2 Number of sentences by media
We check the number of sentences available by title (1) and order of sentences in description (2 to 5)
qd<-readRDS("data/corpus/qd_mycorpus.RDS")
x<-data.table(docvars(qd))
tab<-x[,.(tot=.N),by=.(who,order)]
tab<-dcast(tab,order~who)
tab$order<-as.factor(tab$order)
levels(tab$order)<-c("Title","Sent1","Sent2","Sent3")
kable(tab, caption = "Distribution of title and sentences by media")| order | fr_BEL_derheu | fr_BEL_lesoir | fr_DZA_elwata | fr_DZA_xpress | fr_FRA_figaro | fr_FRA_libera |
|---|---|---|---|---|---|---|
| Title | 6994 | 10815 | 2896 | 4794 | 9449 | 13703 |
| Sent1 | 6962 | 10730 | 2925 | 4755 | 9423 | 11227 |
| Sent2 | 1884 | 3517 | 2893 | 249 | 3591 | 3453 |
| Sent3 | 518 | 995 | 2867 | 16 | 536 | 444 |
1.3.3 Size of texts by month
We visualize the distribution of sentences of different order through time in order to prepare a decision on the length of text to be kept.
tab<-x[,.(tot=.N),by=.(month,order)]
tab$month<-as.Date(tab$month)
tab$order<-as.factor(tab$order)
levels(tab$order)<-c("Title","Sent1","Sent2","Sent3")
p<-ggplot(tab, aes(x=month,fill = order, y=tot))+
geom_bar(stat="identity")+
ggtitle(label ="Corpus : distribution of titles and sentences by month",
subtitle = "1st Jan 2014 to 31th Dec. 2015")
p4 Hypercubes
This section is based on the TELEMAC application elaborated during the H2020 projected ODYCCEUS and presented in the paper published in the journal Frontiers and available at https://analytics.huma-num.fr/Claude.Grasland/telemac/
Our objective is to elaborate an hypercube organised by different dimensions. As an example, we suppose that we are interested in the analysis of the crisis of migrant and refugees (what) in different newspapers (who), at different period of time (when) and we want to explore the locations of countries that are mentioned (where) and eventually associated together (where1.where2). Finally we want to distinguish inside the news the possible changes of results if we consider the title or the first, second and third sentences of the description (order).
4.1 Definition of dimensions
To illustrate this different options, we can look at the example of a news published by the Algerian newspaper El Watan the 16th September 2015 and divided in a title and three sentences of description.
qd<-readRDS("data/corpus/qd_mycorpus_geo_top.RDS")
examp<-corpus_subset(qd,docid(qd) == 9486265)
kable(paste(examp))| x |
|---|
| Crise des réfugiés en Europe : Vers un conseil des chefs d’Etat et de gouvernement de l’UE . |
| L’Allemagne, l’Autriche et la Slovaquie ont appelé, hier, à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire. |
| Après l’échec lundi de la réunion extraordinaire à Bruxelles des ministres de l’Intérieur de l’Union européenne (UE) sur la répartition des réfugiés par quotas, l’Allemagne, l’Autriche et la Slovaquie ont appelé hier à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire, rapporte l’AFP. |
| «C’est un problème pour l’Union européenne dans son ensemble, c’est pourquoi nous nous sommes prononcés pour la tenue la semaine prochaine d’un conseil extraordinaire de l’UE», a déclaré la chancelière allemande lors d’une conférence de presse avec son homologue autrichien Werner Faymann. |
Thanks to the previous operations of geographical and topical tagging, we can propose a simplified table where the text of the news has been removed and where we keep only the information of interest for the agregation procedure.
examp$id<-as.character(docid(examp))
dtexamp<-data.table(tidy(examp)) %>% select(id=id, order = order, who = who, when=day, what=mobil, where1 = states, where2=states)
kable(dtexamp)| id | order | who | when | what | where1 | where2 |
|---|---|---|---|---|---|---|
| 9486265 | 1 | fr_DZA_elwata | 2015-09-16 | refu | ||
| 9486265 | 2 | fr_DZA_elwata | 2015-09-16 | migr | DEU AUT SVK | DEU AUT SVK |
| 9486265 | 3 | fr_DZA_elwata | 2015-09-16 | refu migr | BEL DEU AUT SVK | BEL DEU AUT SVK |
| 9486265 | 4 | fr_DZA_elwata | 2015-09-16 |
The hypercube is the result of an aggregation of foreign news according several dimensions:
who : this dimension is related to the variable which describe the media outlets which published the RSS feeds. Each source is related to a code
ll_sss_xxxxxxwherellis the language,sssis the ISO3 code of the country andxxxxxxthe name of the media. For instance, a RSS feed produced by the Algerian newspaper El Watan is identified by the code who =fr_DZA_elwata. Starting from there, it is then possible to proceed to aggregation of the data by group of languages (eg. computation of the indicators for all the French speaking newspapers) or countries (compute the indicators for all the media outlets located in Algeria).when : this dimension describe the day when an article of the RSS feeds has been published, according to a reference time zone (Paris in present case). Starting from the day, the data will be further aggregated according to different period of aggregation: weeks, months, quarters or years. . For instance, by choosing to work on monthly aggregated data, the first period of observation for the news presented as example will be:
when = 2015-09-01. If we choose a division in weeks, we have to decide if the week start on Sunday (default option of R) or start on Monday (option adopted in present case)where1 and where2 : this dual dimension is associated to the cross-list of foreign countries detected by the country dictionary in the news. For example the second sentence of our exampple (“L’Allemagne, l’Autriche et la Slovaquie ont appelé, hier, à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire.”) has produced a list of three places (DEU,AUT,SVK) associated to the cross-list of nine couple of places (AUT-AUT, AUT-DEU, AUT-SVK, DEU-DEU, DEU-AUT, DEU-SVK, SVK-AUT, SVK-DEU, SVK-SVK) where each couple will receive a weight of 1/9. It is important to keep in mind that the countries where the media are located (mentioned in the
whodimension) should be excluded from the list if we decide to work only on foreign news.what : In general, this dimension can be described as a boolean value (TRUE/FALSE) which precise if the news is associated or not to the topic of interest. For example the title and the two first sentences of our example are associated to the topic of international mobility but not the third sentence where the expected keywords has not been found. But if we have introduced subtopics, the situation is more complex because the news can be associated to different subtopics (as it was associated to different states). For example the second sentence of the description (“Après l’échec lundi de la réunion extraordinaire à Bruxelles des ministres de l’Intérieur de l’Union européenne (UE) sur la répartition des réfugiés par quotas, l’Allemagne, l’Autriche et la Slovaquie ont appelé hier à la tenue, dès la semaine prochaine, d’un conseil européen des chefs d’Etat et de gouvernement consacré à la crise migratoire, rapporte l’AFP”) is associated to 2 subtopics (refug, migr) and 4 countries (BEL, AUT, DEU, SVK). It will therefore be broken in 2 x 4 x4 = 32 pieces of information, each of them associated to a value of 1/16th.
order : To build the hypercube, it is possible to works on different size of text units: (
order=1): the title or the first sentence, or (order = 2,3,4, ...): the title with the selected number of sentence of the description available. This parameter is important because some results, especially regarding the spatial dimension of the analysis (where) are more noticeable on longer texts. In our example, it is clear that the conclusions would be different if we had decided to focus only on the title which does not mention any country and is only associated to the subtopic of refugees.
4.2 Aggregation function
The elaboration of the hypercube is based on the crossing of all dimensions with one line for each singular combination. To do that, we have elaborated a specific function that combine all the 6 dimensions but can be easily adapted if less dimensions are needed.
#' @title create an hypercube
#' @name hypercube
#' @description create a network of interlinked states
#' @param corpus a corpus of news in quanteda format
#' @param order an order of sentences in the news
#' @param who the source dimension
#' @param when the time dimension
#' @param timespan aggreation of time
#' @param what a list of topics
#' @param where1 a list of states
#' @param where2 a list of states
hypercube <- function( corpus = qd,
order = "order",
who = "source",
when = "when",
timespan = "week",
what = "what",
where1 = "where1",
where2 = "where2")
{
# prepare data
don<-docvars(corpus)
df<-data.table(id = docid(corpus),
order = don[[order]],
who = don[[who]],
when = don[[when]],
what = don[[what]],
where1 = don[[where1]],
where2 = don[[where2]])
# adjust id
df$id<-paste(df$id,"_",df$order,sep="")
# change time span
df$when<-as.character(cut(as.Date(df$when), timespan, start.on.monday = TRUE))
# unnest where1
df$where1[df$where1==""]<-"_no_"
df<-unnest_tokens(df,where1,where1,to_lower=F)
# unnest where2
df$where2[df$where2==""]<-"_no_"
df<-unnest_tokens(df,where2,where2,to_lower=F)
# unnest what
df$what[df$what==""]<-"_no_"
df<-unnest_tokens(df,what,what,to_lower=F)
# Compute weight of news
newswgt<-df[,list(wgt=1/.N),list(id)]
df <- merge(df,newswgt, by="id")
# ------------------------ Hypercube creation --------------------#
# Aggregate
hc<- df[,.(tags = .N, news=sum(wgt)) ,.(order,who, when,where1,where2, what)]
# Convert date to time
hc$when<-as.Date(hc$when)
# export
return(hc)
}In order to test the function, we apply it firstly on our small example of the single news published by El Watan
hc_examp<-hypercube( corpus = examp,
order = "order",
who = "who",
when = "when",
timespan = "day",
what = "mobil",
where1 = "states",
where2 = "states")
kable(hc_examp)| order | who | when | where1 | where2 | what | tags | news |
|---|---|---|---|---|---|---|---|
| 1 | fr_DZA_elwata | 2015-09-16 | no | no | refu | 1 | 1.0000000 |
| 2 | fr_DZA_elwata | 2015-09-16 | DEU | DEU | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | DEU | AUT | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | DEU | SVK | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | AUT | DEU | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | AUT | AUT | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | AUT | SVK | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | SVK | DEU | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | SVK | AUT | migr | 1 | 0.1111111 |
| 2 | fr_DZA_elwata | 2015-09-16 | SVK | SVK | migr | 1 | 0.1111111 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | BEL | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | BEL | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | DEU | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | DEU | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | AUT | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | AUT | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | SVK | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | BEL | SVK | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | BEL | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | BEL | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | DEU | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | DEU | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | AUT | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | AUT | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | SVK | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | DEU | SVK | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | BEL | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | BEL | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | DEU | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | DEU | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | AUT | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | AUT | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | SVK | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | AUT | SVK | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | BEL | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | BEL | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | DEU | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | DEU | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | AUT | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | AUT | migr | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | SVK | refu | 1 | 0.0312500 |
| 3 | fr_DZA_elwata | 2015-09-16 | SVK | SVK | migr | 1 | 0.0312500 |
| 4 | fr_DZA_elwata | 2015-09-16 | no | no | no | 1 | 1.0000000 |
- order = 1 : the title is described by a single line because we have only one subtopic and no states mentioned. The weight of the line is 1.
- order = 2 : the first sentence of description is characterized by one subtopic and three different states which produce 9 lines with weight of 1/9 = 0.111 news.
- order = 3 : the second sentence of description is characterized by two subtopic and four diffeent states which produce 32 lines with weight of 1/32 = 0.031 news.
- order = 4 : the last sentence of description is characterized by no topics and no states which produce 1 lines with weight of 1.
4.3 Application
Of course it is not interesting to transform a single news in such a large table. But it is of high interest if we realize the agregation on a large number of news. Because in this case the number of combination of dimensions is limited and we can obtain a synthetic table called hypercube that summarize all the information extracted from the news in a relatively small object. The time of computation of an hypercube can be relatively large and the memory size necessary to the intermediary step of disagregation can be important, but the resulting object is small and very adapted for a large number of exploration and modelisation methods.
In practice, the function based on data.table package appears to be very fast as we can see in the following example
hc<-hypercube( corpus = qd,
order = "order",
who = "who",
when = "when",
timespan = "day",
what = "mobil",
where1 = "states",
where2 = "states")
saveRDS(hc,"data/corpus/hc_mycorpus_states_mobil_day.RDS")
paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_mobil_day.RDS")/1000000,3), "Mo")[1] "Size of resulting file = 0.285 Mo"
We can see that the resulting object is rather small (0.3 Mo) which will make easier the production of visualization based on the crossing of the different dimensions.
If we want to wok only at month level, the hypercube could be even smaller :
hc<-hypercube( corpus = qd,
order = "order",
who = "who",
when = "when",
timespan = "month",
what = "mobil",
where1 = "states",
where2 = "states")
saveRDS(hc,"data/corpus/hc_mycorpus_states_mobil_month.RDS")
paste("Size of resulting file = ",round(file.size("data/corpus/hc_mycorpus_states_mobil_month.RDS")/1000000,3), "Mo")[1] "Size of resulting file = 0.149 Mo"
5 Hypercubes exploration
The different dimensions of an hypercube can be analysed through different aggregation of the dimensions of the hypercubes, leading to different tables authorizing different modes of visualization. Each function is named according to the dimensions that are combined.
source("pgm/hypernews_functions_V6.R")We start here from the situation of a researcher interested by the topic of human mobility and we load the hypercube elaborated in the previous section. We decide to analyze the topic without distinction between migrants and refugees.
hc <- readRDS("data/corpus/hc_mycorpus_states_mobil_month.RDS")5.1 WHAT
The first question (WHAT) is the evaluation of the proportion of news related to the topic.
res_what <- what(hc = hc,
subtop = NA,
title = "Topic news")
res_what$table what news pct
1: FALSE 112482 97.272476
2: TRUE 3154 2.727524
res_what$plotlyThe table indicate that 3154 news was associated to the topic which represent 2.72% of the total.
5.2 WHO.WHAT
The second question (WHO.WHAT) explore the variation of interest for the topic in the different media of the corpus.
5.2.1 Example
res_who_what<- who.what(hc=hc,
test = FALSE,
minsamp = 20,
mintest = 5,
title = "Topic news by media - Salience")
kable(res_who_what$table)| who | trial | success | null.value | estimate | salience | chi2 | p.value | index |
|---|---|---|---|---|---|---|---|---|
| fr_FRA_libera | 28827 | 607 | 0.0273 | 0.02106 | 0.7714286 | 42.08 | 1.00000 | 0.7714286 |
| fr_DZA_elwata | 11581 | 292 | 0.0273 | 0.02521 | 0.9234432 | 1.82 | 0.91137 | 0.9234432 |
| fr_BEL_derheu | 16358 | 349 | 0.0273 | 0.02134 | 0.7816850 | 21.69 | 1.00000 | 0.7816850 |
| fr_FRA_figaro | 22999 | 1005 | 0.0273 | 0.04370 | 1.6007326 | 232.26 | 0.00000 | 1.6007326 |
| fr_BEL_lesoir | 26057 | 652 | 0.0273 | 0.02502 | 0.9164835 | 5.01 | 0.98737 | 0.9164835 |
| fr_DZA_xpress | 9814 | 249 | 0.0273 | 0.02537 | 0.9293040 | 1.30 | 0.87310 | 0.9293040 |
res_who_what$plotly res_who_what<- who.what(hc=hc,
test = TRUE,
minsamp = 5,
mintest = 1,
title = "Topic news by media - Significance")
res_who_what$plotlyThe analysis reveal a clear over-representation of the topic in the french newspaper Le Figaro (4.37% of news) as compared to the other media (2.1 to 2.5%).
5.3 WHEN.WHAT
The third question (WHEN.WHAT) is related to the evolution of the interest of all media from the corpus for the topic of interest through time.
5.3.1 Example
res_when_what<- when.what(hc=hc,
test=FALSE,
minsamp=10,
mintest=5,
title = "Topic news by month - Salience")
res_when_what$plotlyres_when_what<- when.what(hc=hc,
test=TRUE,
minsamp=10,
mintest=5,
title = "Topic news by month - Significance")
res_when_what$plotlyThe analysis reveals clear discontinuities in the timeline of the topic. We start with a low level (0.5 to 1.2%) from January 2014 to March 2015, followed by a brutal jump in April-June 2015 (3 to 5%) and a major peak in september 2015 (15.8% of news). At the end of the period, the level is clearly higher than at the beginning.
5.4 WHERE.WHAT
The fourth question (WHERE.WHAT) analyze the countries that are the most associated to the topic of interest. We exclude therefore the news where no countries are mentioned and we analyze for each country the proportion of news that are associated to the topic.
5.4.1 Example
map<-readRDS("data/map/world_ctr_4326.Rdata")
hc2<-hc %>% filter(where1 !="_no_", where2 !="_no_")
res_where_what<- where.what(hc=hc2,
test=FALSE,
map = map,
minsamp=10,
mintest =5,
title = "Topic news by states - Salience")
res_where_what$plotlyres_where_what<- where.what(hc=hc2,
test=TRUE,
minsamp=10,
map = map,
mintest =5,
title = "Topic news by states - Significance")
res_where_what$plotlyThe analysis reveals that some countries are “specialized” in the topic during the period of observation. For example 53.5% of the news about Hungary was associated to the question of migrants and refugees, which is obviously related to the mediatization of the wall established by Viktor Orban in 2015. Other countries are characterized on the contrary by an under-representation of the topic like the USA where the topic is only associated to 0.7% of news. But the situation will change after Donald Trump’s election who will also establish a wall which will dramatically increase the number of news about USA and migrants.
5.5 WHEN.WHO.WHAT
Despite our limited sample size, we can try to ask more complex question that combine three dimensions. We can for example examine the synchronization of media through time about the topic of interest (WHEN.WHO.WHAT).
5.5.1 Example
res_when_who_what<- when.who.what(hc=hc,
test = FALSE,
minsamp = 20,
mintest = 5,
title = "Topic news by month and by media - Salience")
res_when_who_what$plotlyres_when_who_what<- when.who.what(hc=hc,
test = TRUE,
minsamp = 20,
mintest = 5,
title = "Topic news by month and by media - Significance")
res_when_who_what$plotlyThe figure reveals a global synchronization of media agenda concerning the topic, especially concerning the major peak of interest located in september 2015. The first period of crisis of April 2015 is also visible in all media, with the exception of the belgian newspaper “Dernière Heure” which did not cover apparently the dramatic events of boat sinking in the Mediterranean more than usual. Another interesting difference can be observed for the two algerian newspapers that was characterized by an higher coverage of the topic during the year 2014.
5.6 WHERE.WHO.WHAT
Another example of combination of the three dimensions can be realized by exploring if some countries are more mentioned by some media in relation with the topic of interest. In other words, do we observe a geographic synchronization of the agenda of media.
5.6.1 Example
hc2<-hc %>% filter(where1 !="_no_", where2 !="_no_") %>% mutate(who=substr(who,4,6))
res_where_who_what<- where.who.what(hc= hc2,
maxloc= 10,
test=FALSE,
minsamp=5,
mintest=2,
title = "Topic news by media and by states - Salience")
res_where_who_what$plotlyres_where_who_what<- where.who.what(hc= hc2,
maxloc= 10,
test=TRUE,
minsamp=5,
mintest=2,
title = "Topic news by media and by states - Salience")
res_where_who_what$plotlyThe analyse reveals that some countries are systematically associated to the topic by all media like Turkey or Greece. But other countries like Syria are associated to variable patterns of interest in relation with the topic: it is clearly more associated in Algeria, neutral in Belgium and less associated in France.
5.7 WHEN.WHERE.WHAT
The last case of combination of three dimensions concerns the time variation of the association between the topics and the countries through time. It can typically reveal the effect of dramatic event occuring in one country at a period of time. Unfortunately, the sample is too limited in size for an in depth exploration of crisis and we are obliged to limit our example to the comparison of the 8 quarters of years.
5.7.1 Function
5.7.2 Example
hc2<-hc %>% filter(where1 !="_no_", where2 !="_no_") %>% mutate(when=cut(when, breaks="quarte"))
res_when_where_what<- when.where.what(hc=hc2,
maxloc= 10,
test=FALSE,
minsamp=5,
mintest=2,
title = "Topic news by year and by states - Salience")
res_when_where_what$plotlyres_when_where_what<- when.where.what(hc=hc2,
maxloc= 10,
test=TRUE,
minsamp=5,
mintest=2,
title = "Topic news by year and by states - Significance")
res_when_where_what$plotlyThe analysis confirms that in the majority of case, the most important association of countries with the topic took place in 2015 with a major peak in the third quarter (August-September-October). But some interesting exceptions can be observed, in particular in the case of Italy which was associated earlier to the question of migrants and refugees. But the analysis is difficult here because of the too important level of time aggregation.
Bibliographie
Annexes
Infos session
| setting | value |
|---|---|
| version | R version 4.0.2 (2020-06-22) |
| os | macOS Catalina 10.15.7 |
| system | x86_64, darwin17.0 |
| ui | X11 |
| language | (EN) |
| collate | fr_FR.UTF-8 |
| ctype | fr_FR.UTF-8 |
| tz | Europe/Paris |
| date | 2021-12-02 |
| package | ondiskversion | source |
|---|---|---|
| data.table | 1.13.0 | CRAN (R 4.0.2) |
| dplyr | 1.0.2 | CRAN (R 4.0.2) |
| ggplot2 | 3.3.3 | CRAN (R 4.0.2) |
| ggraph | 2.0.4 | CRAN (R 4.0.2) |
| knitr | 1.34 | CRAN (R 4.0.2) |
| lubridate | 1.7.9.2 | CRAN (R 4.0.2) |
| plotly | 4.9.2.2 | CRAN (R 4.0.2) |
| quanteda | 3.0.0 | CRAN (R 4.0.2) |
| RColorBrewer | 1.1.2 | CRAN (R 4.0.2) |
| readr | 1.4.0 | CRAN (R 4.0.2) |
| readtext | 0.80 | CRAN (R 4.0.2) |
| rmarkdown | 2.11 | CRAN (R 4.0.2) |
| rzine | 0.1.0 | gitlab (rzine/package@a94bf55) |
| sf | 0.9.8 | CRAN (R 4.0.2) |
| stringr | 1.4.0 | CRAN (R 4.0.2) |
| tidygraph | 1.2.0 | CRAN (R 4.0.2) |
| tidytext | 0.2.6 | CRAN (R 4.0.2) |
| visNetwork | 2.0.9 | CRAN (R 4.0.2) |
Citation
@Manual{ficheRzine,
title = {Titre de la fiche},
author = {{Auteur.e.s}},
organization = {Rzine},
year = {202x},
url = {http://rzine.fr/},
}